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Abstract 

The determination of Pass/Fail decisions over Borderline grades, (i.e., grades which do not clearly distinguish 
between the competent and incompetent examinees) has been an ongoing challenge for academic institutions. 
This study utilises the Objective Borderline Method (OBM) to determine examinee ability and item difficulty, 
and from that reclassifying each Borderline grade as a Pass or Fail. Using the OBM, examinees’ Borderline 
grades from a clinical examination were reclassified into Pass or Fail. The predictive validity of this method was 
estimated by comparing the examination original and reclassified grades to each other and to subsequent clinical 
examination results. The new model appeared as more stringent (p<.0001) than the original decisions. 
Implications for educators and policy makers are discussed. The OBM2 is found to provide a plausible solution 
for decision making over borderline grades in non-compensatory assessment systems. 

Keywords: borderline grades, board of bxaminers, examinations, decision making 

1. Introduction 

1.1 Standard Setting and Decision Making in Higher Education 

One of the most challenging tasks in clinical assessments is the Pass/Fail decision for borderline performance 
(Kramer et al„ 2003; Patricio et ah, 2009; Schoonheim-Klein et ah, 2009; Shulruf, Turner, Poole, & Wilkinson, 
2013; Wood, Humphrey-Murto, & Norman, 2006). This challenge is particularly difficult since many types of 
clinical examination include a “Borderline performance” in their marking sheets (Boursicot, Roberts, & Pell, 
2007; Roberts, Newbie, Jolly, Reed, & Hampton, 2006; Schoonheim-Klein et ah, 2009; Wilkinson, Newbie, & 
Frampton, 2001). As many clinical examinations are of high stake (Shumway & Harden, 2003), it is essential to 
make an accurate Pass/Fail decision which does not fail a competent student and does not pass an incompetent 
student, particularly given evidence that borderline students tend to remain underachieving throughout their 
studies (Pell, Fuller, Homer, & Roberts, 2012). The General Medical Council (2009a, 2009b) also expressed its 
concerns about assessment and standard setting practices in medical programmes in the United Kingdom. To 
address this critical issue a plethora of standard setting methods have been introduced and implemented in a 
range of clinical examinations (Boulet, De Champlain, & McKinley, 2003; Jalili, Hejri, & Norcini, 2011; Shulruf 
et ah, 2013; Wass, Vleuten, Shatzer, & Jones, 2001; Wilkinson et ah, 2001). Nonetheless despite this range of 
methods concerns about reliability, validity and acceptability remain (Ben-David, 2000; Brannick, Erol-Korkmaz, 
& Prewett, 2011) particularly within the context of clinical assessment where clinical examiners tend to avoid 
failing students and trainees (Cleland, Knight, Rees, Tracey, & Bond, 2008; Dudek, Marks, & Regehr, 2005; 
Morton, Cumming, & Cameron, 2007; Rees, Knight, & Cleland, 2009). 

Most standard setting methods determine a Pass/Fail decision for Borderline grades by identifying a cutoff score 
within the borderline range by statisticaPmathematical calculations deemed to be objective (Ben-David, 2000; 
Cizek, 2012; Cizek & Bunch, 2007). Among the most commonly used methods are the Nedelsky, Ebel, Angoff, 
Hofstee, Borderline Group, and Regression methods (Ben-David, 2000; Cizek, 2012; Cizek & Bunch, 2007). 
Nedelsky, Ebel, Angoff and Hofstee methods use expert panels to estimate what a cutoff score should be 
(Cusimano & Rothman, 2003; Geisinger & McCormick, 2010; Hurtz & Auerbach, 2003; Kaufman, Mann, 
Muijtjens, & Vleuten, 2000; Kramer et ah, 2003; Verheggen, Muijtjens, Van Os, & Schuwirth, 2008; Wass et ah, 
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2001; Wayne et al., 2005), whereas the Borderline Group and Regression methods use only the test scores 
without any additional post examination judgment (Boursicot et al., 2007; Shulmf et ah, 2013; Smee, 2001; 
Wilkinson, Frampton, Thompson-Fawcett, & Egan, 2003). Methods based on experts’judgment are susceptible 
to judgment bias and to date no consensus has been reached on an optimal way to achieve high reliability 
without recruiting a large number of experts (Ben-David, 2000; Brannick et ah, 2011; Chang, Dziuban, Flynes, & 
Olson, 1996; Cizek & Bunch, 2007; Hurtz & Auerbach, 2003; Skorupski & Hambleton, 2005; Wayne et ah, 
2005). 

Determining cutoff scores for Borderline grades by a combination of objective and subjective scores is 
commonly used since many clinical examinations employ marking sheets which include scores for particular 
tasks (deemed as “objective”) as well as a global rating (“subjective”) (Ben-David, 2000; Cizek & Bunch, 2007; 
Roberts et ah, 2006; Wilkinson et ah, 2003; Wilkinson et ah, 2001). Such methods, although deemed to be 
reasonably objective, include an inherent flaw where the same sum of all the “objective” scores i.e. specific 
skills could be classified as “Pass”, “Borderline” or “Fail” on the subjective score (global performance) for 
different examinees (Boursicot, Roberts, & Pell, 2006; Boursicot et ah, 2007; Cizek & Bunch, 2007; Kellow & 
Willson, 2008; Norcini, 2003; Shulruf et ah, 2013; Wood et ah, 2006). It is therefore not surprising that 
significant inconsistencies have been found when these two types of standard setting methods were compared 
(Kaufman et ah, 2000; Kramer et ah, 2003; Lagha, Boscardin, May, & Fung, 2012; Wayne et ah, 2005). 

1.2 Modern Test Theory> and Decision Making 

Modern test theories such as item response theory (IRT) have become more commonly used in medical 
education (Boulet et ah, 2003; Downing, 2003). IRT methods have been mostly used for improving the quality of 
written test items rather than determining Pass/Fail cutoff scores for clinical examinations (Downing, 2003; 
Schuwirth & Vleuten, 2010), or helping to calibrate test items before applying other commonly used standard 
setting methods (Boulet et ah, 2003; Ferdous & Plake, 2008; Grosse & Wright, 1986; MacCann & Stanley, 2006; 
Wang, Wiser, & Newman, 2001). The most advanced standard-setting method that uses IRT framework for 
determining Pass/Fail cutoff score is the Bookmark method (Buckendahl, Smith, Impara, & Plake, 2002; 
Karantonis & Sireci, 2006; Lewis, Mitzel, & Green, 1996; Peterson, Schulz, & Engelhard Jr., 2011). Nonetheless, 
the Bookmark method has been criticized mainly for being resource intensive and the use of arbitrary value (.67 
probability of success) to establish the point that is used to rank order items for the judges’ booklets, which may 
explain why it has not been widely used for setting cutoff scores in clinical examinations (Karantonis & Sireci, 
2006; Lewis et ah, 1996). It is also noted that although findings suggest that the Bookmark is preferable over 
Angoff method (Peterson et ah, 2011), concerns about judges’ ability to make reliable decisions despite that 
additional information remain (Davis-Becker, Buckendahl, & Gerrow, 2011; Deunk, Van Kuijk, & Bosker, 
2014). 

1.3 The Objective Borderline Method: An Alternative Method for Decision Making over Borderline Grades 

In response to the abovementioned challenges, Shulruf et ah (2013) recently introduced the Objective Borderline 
Method (OBM) which is based upon a measure of difficulty of the examination in question, formed from the 
initial results that the students’ actually obtained on this examination. There are two separate but related and 
fairly natural measures of difficulty available (Raykov & Marcoulides, 2011; Sax & Reade, 1964). One seeks to 
combine these two measures into a single measure. There are innumerable ways in which this might be done. A 
simple, plausible and perhaps intuitively way is to think of the two initial measures (which are both numbers 
between 0 and 1 and generated from observed proportions of those two categories) as being notionally the 
probabilities of success on two independent tests or experiments. The measure is formed simply as the product of 
these two probabilities and may thus be conceptualized as the probability of success on both tests. It must be 
emphasized here that these two independent tests are conceptual only. However, they serve as a useful heuristic 
guide to our thinking in constructing the combined measure. This combined measure is just an index (similar to 
other indices e.g. BMI) that its validity is determined only by its usefulness. This combined probability or index 
is by no mean the probability of the occurrence of any actual event (Shulruf et al., 2013). 

Explicitly the initial results consist of a number of Fail, Borderline, and Pass grades. The first notional test 
consists of drawing a grade at random from the collection of all Fail and Borderline grades. “Success” is 
considered to be drawing a Borderline grade, and the first measure of difficulty is the probability of drawing a 
Borderline grade (i.e. the observed proportion of Borderlines among the pool of Borderlines and Fails). The 
second notional test consists of drawing a grade at random from the collection of all Borderline and Pass grades. 
Here “success” is considered to be drawing a Pass grade (i.e. the observed proportion of Passes among the pool 
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of Passes and Borderlines). Each of these two tests (achieving Borderline rather than Fail and achieving Pass 
rather than borderline) is a common measure of difficulty when a test includes two categories (Raykov & 
Marcoulides, 2011; Sax & Reade, 1964). If the numbers of Fail, Borderline and Pass Grades are n . n. n 
respectively then the probability of success on the first notional test is P = n fi / (n p + n 0 ) and the probability of 
success on the second notional test is P = n / (n + n ). The combined measure of difficulty is then P = P x p 

r2 P v B p y J r rl r2 

= (n g / (n p + n 0 ) x (n p / (n 0 + n ). Figure 1 schematically illustrates the how the OMB index is calculated. 

The OBM utilizes P in such a way that it assigns conceded Pass to the proportion of Borderline grades equals to 
P . and conceded Fail to the remaining Borderline grades. 

It is acknowledged that like all standard setting models the OBM is derived from some arbitrary premises (Cizek, 
1993). Flowever, Shulruf et al. (2013) demonstrated that the OBM is at least as effective as other standard setting 
methods (e.g. the Regression and the Borderline Groups methods). 



1.4 The Overarching Objectives of the Current Study 

The current study introduces a modification to the OBM model which enables making pass/fail decisions for any 
types of marks (continuous or categorical) as long as marks can be initially classified into three categories: Pass, 
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Borderline and Fail and the number of Passes is greater than zero. It offers a practical and theoretically 
defensible method to determine which of the Borderline grades, within a categorical set of grades, should be 
considered as Pass and which should be considered as Fail. The improved model is named the Objective 
Borderline Method 2 (OBM2) as it uses two measures (examinee ability and item difficulty) to determine 
whether a Borderline grade should be reclassified as Pass or Fail. Unlike OBM, the OBM2 does not establish a 
cut-score but it determines whether a Borderline grade should be Pass on Fail on a case by case basis. Such a 
solution may betefit any panel of examineres who need to make pass/fail decision over borderline grade for 
non-compensatory assessment systems, where high score in one domain cannot compensate for a low score on 
another. A thorough review that took place in the preparation of this study failed to identify any such method. 

1.5 The Objective Borderline Method 2 (OBM2) 

The original OBM estimates the combined probability (P ) of being successful in two notional tests based on the 
counts of Passes, Borderlines, and Fails (n p n Q n p respectively) for a set of examination scores of a group of 
examinees. OBM2 uses the same approach as the OBM but at the item rather than the examination level, 
assuming all items are unidimensional (Hattie, 1985). 

Consequently, when a group of examinees are assessed using a set of unidimensional items and their 
performance is classified as Pass, Borderline or Fail, it is possible to calculate two different combined 
probabilities (i.e. P ) for each Borderline grade. The first is based on the particular examinee’s grades across all 
items (referred to as the examinee’s P f and denoted as P ); the second P is based on the grades for a particular 
item across all examinees (referred to as the item’s P and denoted as P.). Analogous to Item Response Theory 
(Kolen & Brennan, 2004), P is a measure of examinee’s ability and P is a measure of item difficulty. The 
relationship between these two probabilities (P and P ) can be used to determine whether the Borderline grade 
should be conceded Pass or Fail. This relationship is expressed by a decision index (P ), which is the quotient 

P=(P e )/((P e ) + (P i )). 

When P >.5 it means that the examinee’s ability is greater than or equal to item difficulty hence the Borderline 
grade should be conceded Pass. Note that when P d =.5 the Pass/Fail decision cannot be determined by this index. 
In this case the decision must be determined by a pre-specified policy. 

The current study aims to estimate the validity of the OBM2 by examining what the consequences would be if 
Borderline grades of medical students’ clinical examination were reclassified as Pass or Fail using the OBM2. 

2. Methods 

2.1 Data 

The UNSW Medicine program is a 6 year undergraduate entry program organized into three phases, each 
comprised of two academic years. At the end of each phase, students must pass a clinical skills examination before 
progressing to the next phase. This study used data from the Phase 1 and Phase 2 clinical examinations from five 
cohorts of students. 

Each of the Phase 1 and Phase 2 clinical skills examinations comprises six standardised stations (for more details 
on the curriculum and clinical assessments see: McNeil, Hughes, Toohey, & Dowton, 2006). The students are 
assessed in nine criteria encompassing generic communication skills, clinical history skills and physical 
examination skills. A standard grading sheet is used at each station with additional specific descriptors relevant to 
the station’s tasks. A common 4-point grading system is used for each criterion: Fail, Borderline, Pass and 
Exceptional. The examiners do not provide a global grade for the station. A Pass/Fail decision for each station is 
based on the proportion of Fail and Borderline grades—failing a station is a result of at least two Fail grades or a 
combination of one Fail grade and more than two Borderline grades. A Pass/Fail decision for the examination is 
based on the number of failed stations—students must pass at least three stations. Each grade is also converted to a 
numerical score (with Borderline representing 50% of maximum score); a Fail decision is also made if a student’s 
total numerical score is <50%. Students who fail the Phase 1 clinical skills examination are offered a 
supplementary examination after a period of remediation. Students who fail the supplementary examination are 
excluded from the program. 
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2.2 Sample 

Test data were available from 1,136 students who sat the Phase 1 clinical examination. Of these students, 42 did 
not progress to the Phase 2 clinical examination and their grades in Phase 2 clinical examination were considered 
in our analysis as Fail. This inclusion was based on data not presenting in this study suggesting that the 
discontinuation of those students was due to dissatisfactory performance in their clinical and non-clinical studies in 
Phase 1. Thus, this analysis includes all 1,136 students (Y2004, N=210; Y2005, N=229; Y2006, N=226; Y2007, 
N=238; Y2008, N=233). Demographic data such gender, age or ethnicities were not included in the dataset and the 
analysis as they were not deemed relevant to the model discussed. 

2.3 Analysis 

The first analysis employed factor analysis of raw examination scores within each station to ensure 
unidimensionality of the items (Flattie, 1985). Then, within each station the decision index (P rf ) was calculated 
leading to the assignment of Pass/Fail to each Borderline grade that was originally given to a student for a 
performance criterion within a station. Next, based on the OBM2 reclassification of grades, pass fail decisions 
for the whole clinical examination (all six stations) were calculated in the way described above (students must 
pass at least three stations and total score from all stations must exceed 50%). 

The last stage compared the predictive validity of the original grades in the Phase 1 clinical examination with the 
reclassified grades derived by the OBM2. The sensitivity, specificity, positive and negative predictive values, and 
accuracy (overall fraction correct) of the Phase 1 grades for predicting performance in the subsequent Phase 2 
clinical examination were measured (see Table 1) (Bossuyt, 2011). 


Table 1. Definition of tine positive, true negative, false positive and false negative (adapted from Bossuyt, 2011) 



Clinical examination results Phase 2 


Pass 

Fail 

Pass/Fail Decision for Borderline grades in 
clinical examination Phase 1 

Pass True Positive (TP) 

Fail False Negative (FN) 

False Positive (FP) 

True Negative (TN) 

Sensitivity 

TP/(TP+FP) 


Specificity 

TN/(TN+FP) 


Positive predictive value 

TP/(TP+FP) 


Negative predictive value 

TN/(TN+FN) 


Accuracy 

(TP+TN)/(TP+TN+FP+FN) 



3. Results 

3.1 Suitability of the Data 

The results indicate that the data did not fully meet the criteria for unidimensionality as in some stations the items 
were loaded on two factors. Nonetheless, Table 2 suggests that within each station there is only one meaningful 
underlying factor since none of the factor loadings in any of the six stations met the criteria for two discrete factor 
structure (Pett, Lackey, & Sullivan, 2003) and the variance explained by the first factor was between 28 and 35 
percent whereas the second factor explained no more than 6%. Thus, it was decided to carry on with the analysis, 
particularly given the high level of internal consistency within each station (Cronbach’s 
alpha= .80, .80, .77, .82, .79, .82 for Stations 1 to 6 respectively). 
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Table 2. Factor Matrix (ML) 


Station 


1 


2 


3 


4 

5 


6 


Factor 


1 

2 

1 

2 

1 

2 

1 2 

1 

2 

1 

2 


i 

.651 

-.338 

.641 

-.174 

.686 

-.374 

.682 

.749 

-.301 

.669 

-.058 


2 

.599 

-.256 

.616 

-.343 

.601 

-.090 

.673 

.652 

-.190 

.640 

-.153 

5=3 

’G 

3 

.571 

-.125 

.586 

.070 

.545 

.165 

.616 

.609 

-.013 

.598 

-.236 

<D 

*G 

4 

.569 

.158 

.576 

-.280 

.540 

.270 

.599 

.551 

.246 

.580 

.192 

<D 

5 

.561 

.340 

.563 

.037 

.538 

-.237 

.587 

.549 

.085 

.576 

-.125 

a 

CZ3 

in 

6 

.556 

-.098 

.562 

.292 

.498 

.110 

.540 

.479 

.381 

.574 

-.225 

<D 

C/3 

C/3 

7 

.552 

.418 

.536 

.289 

.471 

.287 

.538 

.459 

-.049 

.547 

-.097 

C 

8 

.550 

.071 

.533 

.268 

.468 

.185 

.531 

.437 

.094 

.544 

.329 

Variance 

9 

.491 

-.067 

.495 

.090 

.433 

.321 

.527 

.404 

.459 

.543 

.401 


32.3 

5.9 

32.4 

5.3 

28.7 

5.9 

34.9 

30.6 

6.2 

34.5 

5.2 

explained (%) 













The average percentage of Borderline grades that were reclassified as Pass (by criterion by station) was 25.8% 
(range 0.0-58.8%). The comparison of the Pass/Fail decisions of the Phase 1 clinical examination across the 
original grades indicates that the OBM2 model was more stringent than the original decision, yet the decisions 
made by the OBM2 had high level of agreement with the “original decisions” (decisions made by the board of 
examination within the institute) (Accuracy=,88) (Table 3). 

3.2 Models Comparison 


Table 3. Distribution of final pass fail grades by decision model 


Original Decision 


Fail 

Pass 

Accuracy* 

Fail 

9 

132 


OBM2 



0.88 

Pass 

0 

995 


Total 



1136 


Note. *Accuracy= overall fraction correct (proportion of agreement out of all grades) 


The quality of the OBM2 was estimated by comparing the overall clinical examination grades in the Phase 2 
clinical examination with the overall outcomes of the Phase 1 clinical examination as calculated in two ways: by 
the original method and by the OBM2 model. 


Table 4. Distribution of Phase 2 outcomes by Phase 1 outcome by type of decision 


Phase 2 



Pass 

Fail 

Original decision 

Pass 

Fail 

1056 

71 

4 

5 


Pass 

945 

50 

OBM2 

Fail 

115 

26 
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Table 5. Quality indices OBM2 

vs. original decisions 


Index 

OBM2 

Original decision 

Specificity 

.342 

.066 

Sensitivity 

.892 

.996 

False Positive 

.044 

.063 

False Negative 

.101 

.004 

Accuracy 

.855 

.934 


The results indicate that the original decision yielded accuracy of .93 and sensitivity of .99 but specificity of 
only .07. The OBM2 model was less accurate but as a more stringent model it yielded the higher level of 
specificity (.34). 71 (6.6%) students passed the Phase 1 clinical examination based on the original decision but 
failed in Phase 2. In comparison, the OBM2 model passed only 50 (4.6%). The cost of increasing the specificity 
was that the OBM2 model resulted in failing 115 (11%) of students in the Phase 1 clinical examination who were 
later successful in the Phase 2 clinical examination. 

4. Discussion 

The main objective of this study was to utilise the recently introduced Objective Borderline Model (OBM) 
(Shulruf et ah, 2013) for supporting pass/fail decisions for students who performed at the borderline level in their 
clinical examination. This was achieved by modifying the OBM to incorporate two measures (examinee ability 
and item difficulty) for determining whether a Borderline grade should be reclassified as Pass or Fail. In order to 
provide robust evidence, this study followed the relevant recommendations for research on assessment from the 
Ottawa 2010 Conference (Schuwirth et ah, 2011): (a) basing the research on robust scientific theory 
(recommendation 7, 8, and 9); (b) taking the modern approach for validity by looking at consequential validity 
rather than merely comparing one method with another (recommendations 12, 13); (c) adopting the Item Response 
Theory (IRT) conceptual framework in the development of a new method (recommendation 18). We note that 
recommendation 18 was only partially followed as OBM2 applies only one feature analogous to IRT which is the 
comparison of student ability with item difficulty and in no way it is suggested that IRT models were applied in 
this study/model. 

The OBM and OBM2 introduce a new concept in the field of standard setting by “legitimising” the category of a 
Borderline grade. The underlying assumption is that a Borderline grade is one of which indicates that the 
examinee’s assessed performance could not clearly be classified as either Pass or Fail and this is a category by its 
own right (Jalili et al., 2011; Norcini, Shea, & Kanya, 1988). Furthermore, the OBM2 is a plausible solution for 
making decisions when the data suggest uncertainty (Draper, 2005; Ramsey, 1926). Nonetheless the OBM2 is not 
a standard setting method in the sense that it does not set any cut-score but only provides evidence-based indication 
whether a borderline grade should be conceded Pass or Fail. 

The underlying assumption of previous standard setting methods is that there is an inevitable misclassification of 
examinees’ proficiency where some truly proficient examinees are mistakenly classified as not proficient (False 
Negative) and others who truly did not reach the appropriate proficiency level are mistakenly classified as 
proficient (False Positive) (Cizek, 2012; Cizek & Bunch, 2007). The OBM and OBM2 methods address this 
concept of misclassification by determining the range of Borderline grades as the range where the level of 
competency could not be classified without any doubts as either clear Pass or clear Fail (Shulruf et al., 2013). This 
method of classification applies to the determination of the Pass/Fail scores (the definition of in/competency) and 
the actual classification of examinee’s performance by the examiners (Kane, 1994). Skorupski and Flambleton 
(2005) for example, demonstrated that the majority of panelists engaged in the item mapping standard-setting 
method reported having difficulty distinguishing between performance categories. 

Thus enabling examiners or judges to use a “Borderline” category and accepting the uncertainty of such a category 
might be an appropriate approach rather than forcing them to make a decision based on limited information. The 
actual decision whether a Borderline grade should be reclassified as Pass or Fail would then be decided by all data 
points available which deemed to be more reliable. 

The question of where one should set the cutoff point—employing a stringent policy by granting the final Pass for 
the clear Pass (minimizing the number of False Positives) or taking a more lenient policy and granting a final Pass 
to those who did not clearly fail (minimizing False Negative) needs to be decided. This could be decided either by 
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an agreed panelists’ opinion who apply judges-based standard setting or by policy makers who decide which 
test-based (panelist free) standard setting are to be used (Kane, 2013). Since each test-based standard setting 
applies different mathematical procedure, the results are expected to be somewhat different even if applied on the 
very same data (Wood et al., 2006). Consequently no standard setting method, including the OBM/OBM2, could 
be absolutely objective. 

In this study we investigated the impact of the OBM2 with respect to a policy that aims to maximize specificity and 
minimise the number of False Positives. The results clearly demonstrate that both indices would have been 
improved (in accordance with our pre-determined policy) had OBM2 been implemented. Note that progression to 
Phase 2 was based on the original decisions (decisions made by the Board of Examination) and thus this 
comparison is somewhat problematic. However, this limitation would apply to any study using real data which 
resulted in similar decision making. It is noteworthy, however, that had the OBM2 been used, the specificity would 
have increased from 7% to 43% resulting with a trade-off of a drop in the sensitivity from 99% to 89% which 
overall, based on our view, is a preferable outcome for the chosen policy. 

It is evident that the OBM2 model is more stringent than the original decisions made by the Board of Examination 
for determining Pass/Fail and would increase the number of students failing the clinical skills examination. 
However, this is expected and perhaps even desirable. Clinical examiners tend to avoid failing students and 
trainees particularly as they give borderline students the benefit of the doubt (Cleland et al., 2008; Dudek et al., 
2005; Morton et al., 2007; Rees et al., 2009). Such practice is argued to have the potential for major adverse impact 
on medical practice (Albanese, 1999). Moreover, in their comprehensive review of sources of bias in clinical 
performance rating, Williams, Klamen, and McGaghie (2003) summarized compelling evidence suggesting that 
the tendency for leniency is pretty much embedded in clinical assessment practices with little to no impact of 
training on such examiners’ bias. Consequently, applying the OBM2 model where a Borderline grade is made a 
legitimate and well defined category (when neither clear Pass nor clear Fail may confidently granted) would help 
examiners avoiding the leniency bias and minimize passing incompetent examinees (Kane, 1994). 

Currently, most grading criteria describe what a competent examinee should demonstrate but fail to define what 
constitutes borderline performance. Even when the borderline performance is defined, the descriptors are vague, 
indecisive and poorly correlate with the checklist scores (Pell, Fuller, Homer, & Roberts, 2010). The description of 
clear Pass and clear Fail criteria aligned to measureable teaching and learning objectives would provide 
transparent expectations to examinees. 

An important contribution of the OBM2 is that it provides Pass/Fail decisions for Borderline grades when the 
grading system does not use continuous scales but only ordinal categories (e.g. Fail, Borderline, Pass, Excellent). 
This is an advantage of the OBM2 compared to other standard setting methods, which do not have that capacity, 
particularly when the use of a continuous scale makes little sense if any. Moreover, many of the previous standard 
setting models assume that the categories (for example in OSCE stations) are points on an interval scale (Boursicot 
et al., 2007; Cizek & Bunch, 2007; Kramer et al., 2003) although that assumption receives little support from the 
statistical and educational measurement literature, particularly when the number of the ordinal categories is fewer 
than six (Agresti, 2010; Torra, Domingo-Ferrer, Mateo-Sanz, & Ng, 2006). 

How acceptable is the OBM2? Although difficult to judge at this stage, there are some indications that it should be 
easily acceptable. First, it is very easy to use and requires no mathematical statistical skills (see Appendix 2). 
Second, the OBM2 does not add any cost to current programs, except the time required revising the Pass/Fail 
criteria as described above, which is negligible compared to other methods using panels of experts (Cizek & 
Bunch, 2007). Third, the analogy between OBM2 and IRT (both use item difficulty and examinee ability) might be 
appealing. This analogy is a major advance in the field because by definition a Borderline grade by itself includes 
very little information, only that it is neither clear Pass nor clear Fail. All other information relevant to the 
examinee’s performance on the “borderline item” is embedded in their performance on all other items 
encompassed within the same dimension (Cronbach, 1951). Furthermore, the inclusion of item difficulty in the 
OBM2 method acts as a correction for examiner’s leniency/stringency bias. The Pass/Fail decision is determined 
by the comparison of two calculated probabilities. As clearly observed in [Equation 2], the harder the item the 
smaller the P. thus the greater the P ( (and vice versa for an easier item). This correction enhances the fairness of the 
examination, and removes concerns over examiners’ bias, which has its most critical impact on borderline 
performance. 
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Obviously the 0BM2 raises some challenges which need to be addressed. The 0BM2 compares two calculated 
measures conceptualized as probabilities. However, a small number of items may affect its resolution, which if 
not fine enough might impair its effectiveness. In this study there were nine items (criteria) for each station 
which yielded 25 unique values of P , providing sufficient resolution for calculating (P ). Six items, however, 
would yield only 12 unique values of P . Hence we recommend that the OBM2 should be used only when six or 
more items are included. Nonetheless, further investigation is needed to determine the impact of the number of 
items on the OBM2 outcomes. 

More limitations are related to the study itself. This study used test data from five past clinical examinations (five 
cohorts). The grading sheets defined Fail criteria based only on curriculum objectives which were deemed to 
suffice for this pilot study. Since no practical decision was made based on the OBM2, this deviation from the 
suggested practice is minor. It is therefore, recommended that further studies take a prospective approach to ensure 
that Pass and Fail criteria are defined as described earlier in this paper. The other minor limitation is that the 42 
students who did not progress to Phase 2 were deemed to have failed the clinical examination in that Phase. This 
decision was made as most of these students did not continue due to poor performance in Phase 1. Since no 
information on performance in Phase 2 was available, any imputation of data to the clinical examination results of 
Phase 2 would anyway be based on performance in Phase 1. Therefore, given the low number of failures in the 
programme, it was believed that including those students in the analysis and assigning them a Fail outcome in the 
Phase 2 clinical examination would be the most plausible approach. 

An important feature of the OBM2 is that it is applicable to non-compensatory assessment systems, where high 
score in one domain cannot compensate for a low score on another. No previous study was found in the literature to 
address decision making over borderline grades for such assessment systems. It is acknowledged that this is the 
first step only and further research may yield better formulae/indices to support decision making over borderline 
grades either within or beyond the OBM framework. 

5. Conclusion 

Michael Kane’s definition of validity provides some important insight into this the research on standard setting: 
“Validity is a property of the interpretations assigned to test scores, and these interpretations are considered valid if 
they are supported by convincing evidence” (Kane, 2013, p. 56). Like all other methods, the OBM2 has 
advantages and shortcomings and they have been discussed above in detail. Whether the evidence provided in this 
study to support the validity of the OBM2 is sufficiently convincing is left to the readers to judge. Nonetheless, 
unless empirically proved otherwise, the OBM2 is a plausible method for supporting pass/fail decision making for 
borderline grades, particularly when a non-compensatory assessment system is applied and the risk of passing 
incompetent examinees who received Borderline grades is of a major concern. 
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